# SEE model data package for new datasets
library(ranger)
library(tidyverse) # for graphing and data cleaning
library(tidymodels) # for modeling
library(stacks) # for stacking models
library(naniar) # for examining missing values (NAs)
library(lubridate) # for date manipulation
library(moderndive) # for King County housing data
library(vip) # for variable importance plots
library(DALEX) # for model interpretation
library(DALEXtra) # for extension of DALEX
library(patchwork) # for combining plots nicely
library(kknn)
theme_set(theme_minimal()) # Lisa's favorite theme
data("lending_club")
# Data dictionary (as close as I could find): https://www.kaggle.com/wordsforthewise/lending-club/discussion/170691
When you finish the assignment, remove the # from the options chunk at the top, so that messages and warnings aren’t printed. If you are getting errors in your code, add error = TRUE so that the file knits. I would recommend not removing the # until you are completely finished.
From now on, GitHub should be part of your routine when doing assignments. I recommend making it part of your process anytime you are working in R, but I’ll make you show it’s part of your process for assignments.
Task: When you are finished with the assignment, post a link below to the GitHub repo for the assignment. If you want to post it to your personal website, that’s ok (not required). Make sure the link goes to a spot in the repo where I can easily find this assignment. For example, if you have a website with a blog and post the assignment as a blog post, link to the post’s folder in the repo. As an example, I’ve linked to my GitHub stacking material here.
Jackson’s Link: https://github.com/jacksontak/assignment2
Before jumping into these problems, you should read through (and follow along with!) the model stacking and global model interpretation tutorials on the Course Materials tab of the course website.
We’ll be using the lending_club dataset from the modeldata library, which is part of tidymodels. The data dictionary they reference doesn’t seem to exist anymore, but it seems the one on this kaggle discussion is pretty close. It might also help to read a bit about Lending Club before starting in on the exercises.
The outcome we are interested in predicting is Class. And according to the dataset’s help page, its values are “either ‘good’ (meaning that the loan was fully paid back or currently on-time) or ‘bad’ (charged off, defaulted, of 21-120 days late)”.
Tasks: I will be expanding these, but this gives a good outline.
1. Explore the data, concentrating on examining distributions of variables and examining missing values.
# quantitative variables distributions
lending_club %>%
select(where(is.numeric)) %>%
pivot_longer(cols = everything(),
names_to = "variable",
values_to = "value") %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 30) +
facet_wrap(vars(variable),
scales = "free")
Based on the distribution plots above, we find that a lot of variables are heavily right-skewed. This suggests that we will have to log the variables instead of just interpreting the variables themselves. We also find some outliers in annual_inc and int_rate variables.
# Categorical variables
lending_club %>%
select(where(is.factor)) %>%
pivot_longer(cols = everything(),
names_to = "variable",
values_to = "value") %>%
ggplot(aes(x = value)) +
geom_bar() +
facet_wrap(vars(variable),
scales = "free")
For the categorical variables, we see that a lot of observations come from certain states than the others. We also find that there are much more good observations in class than the other. For the emp_length variable, there is one outlier as we see from the plot.
# examination of missing variables
lending_club %>%
add_n_miss() %>%
count(n_miss_all)
colSums(is.na(lending_club))
## funded_amnt term
## 0 0
## int_rate sub_grade
## 0 0
## addr_state verification_status
## 0 0
## annual_inc emp_length
## 0 0
## delinq_2yrs inq_last_6mths
## 0 0
## revol_util acc_now_delinq
## 0 0
## open_il_6m open_il_12m
## 0 0
## open_il_24m total_bal_il
## 0 0
## all_util inq_fi
## 0 0
## inq_last_12m delinq_amnt
## 0 0
## num_il_tl total_il_high_credit_limit
## 0 0
## Class
## 0
There seems to be no missing variables in our data set.
2. Do any data cleaning steps that need to happen before the model is build. For example, you might remove any variables that mean the same thing as the response variable (not sure if that happens here), get rid of rows where all variables have missing values, etc.
Be sure to add more “bad” Classes. This is not the best solution, but it will work for now. (Should investigate how to appropriately use step_sample_up() function from themis).
create_more_bad <- lending_club %>%
filter(Class == "bad") %>%
sample_n(size = 3000, replace = TRUE)
lending_club_mod <- lending_club %>%
bind_rows(create_more_bad) %>%
# remove zero variance and near zero variance
select(-delinq_amnt, -acc_now_delinq)
set.seed(494) # for reproducibility
# split the data into training and test
lc_split <- initial_split(lending_club_mod,
prop = .75)
lc_training <- training(lc_split)
lc_testing <- testing(lc_split)
4. Set up the recipe and the pre-processing steps to build a lasso model. Some steps you should take:
step_mutate_at() or this will be a lot of code). We’ll want to do this for the model interpretation we’ll do later.lc_recipe <- recipe(Class ~ .,
data = lc_training) %>%
# make all variables numeric
step_mutate_at(all_numeric(), fn = ~as.numeric(.)) %>%
# Group factor variables with many levels: there aren't any
# normalise quantitative variables
# make all categorical into dummy variables
step_dummy(all_nominal(),-all_outcomes()) %>%
step_normalize(all_predictors(),
-all_nominal(),
-has_role(match = "evaluative"))
lc_recipe %>%
prep(lc_training) %>%
juice()